16 research outputs found
CKG: Dynamic Representation Based on Context and Knowledge Graph
Recently, neural language representation models pre-trained on large corpus
can capture rich co-occurrence information and be fine-tuned in downstream
tasks to improve the performance. As a result, they have achieved
state-of-the-art results in a large range of language tasks. However, there
exists other valuable semantic information such as similar, opposite, or other
possible meanings in external knowledge graphs (KGs). We argue that entities in
KGs could be used to enhance the correct semantic meaning of language
sentences. In this paper, we propose a new method CKG: Dynamic Representation
Based on \textbf{C}ontext and \textbf{K}nowledge \textbf{G}raph. On the one
side, CKG can extract rich semantic information of large corpus. On the other
side, it can make full use of inside information such as co-occurrence in large
corpus and outside information such as similar entities in KGs. We conduct
extensive experiments on a wide range of tasks, including QQP, MRPC, SST-5,
SQuAD, CoNLL 2003, and SNLI. The experiment results show that CKG achieves SOTA
89.2 on SQuAD compared with SAN (84.4), ELMo (85.8), and BERT (88.5)
App Review Driven Collaborative Bug Finding
Software development teams generally welcome any effort to expose bugs in
their code base. In this work, we build on the hypothesis that mobile apps from
the same category (e.g., two web browser apps) may be affected by similar bugs
in their evolution process. It is therefore possible to transfer the experience
of one historical app to quickly find bugs in its new counterparts. This has
been referred to as collaborative bug finding in the literature. Our novelty is
that we guide the bug finding process by considering that existing bugs have
been hinted within app reviews. Concretely, we design the BugRMSys approach to
recommend bug reports for a target app by matching historical bug reports from
apps in the same category with user app reviews of the target app. We
experimentally show that this approach enables us to quickly expose and report
dozens of bugs for targeted apps such as Brave (web browser app). BugRMSys's
implementation relies on DistilBERT to produce natural language text
embeddings. Our pipeline considers similarities between bug reports and app
reviews to identify relevant bugs. We then focus on the app review as well as
potential reproduction steps in the historical bug report (from a same-category
app) to reproduce the bugs.
Overall, after applying BugRMSys to six popular apps, we were able to
identify, reproduce and report 20 new bugs: among these, 9 reports have been
already triaged, 6 were confirmed, and 4 have been fixed by official
development teams, respectively
Patch-CLIP: A Patch-Text Pre-Trained Model
In recent years, patch representation learning has emerged as a necessary
research direction for exploiting the capabilities of machine learning in
software generation. These representations have driven significant performance
enhancements across a variety of tasks involving code changes. While the
progress is undeniable, a common limitation among existing models is their
specialization: they predominantly excel in either predictive tasks, such as
security patch classification, or in generative tasks such as patch description
generation. This dichotomy is further exacerbated by a prevalent dependency on
potentially noisy data sources. Specifically, many models utilize patches
integrated with Abstract Syntax Trees (AST) that, unfortunately, may contain
parsing inaccuracies, thus acting as a suboptimal source of supervision. In
response to these challenges, we introduce PATCH-CLIP, a novel pre-training
framework for patches and natural language text. PATCH-CLIP deploys a
triple-loss training strategy for 1) patch-description contrastive learning,
which enables to separate patches and descriptions in the embedding space, 2)
patch-description matching, which ensures that each patch is associated to its
description in the embedding space, and 3) patch-description generation, which
ensures that the patch embedding is effective for generation. These losses are
implemented for joint learning to achieve good performance in both predictive
and generative tasks involving patches. Empirical evaluations focusing on patch
description generation, demonstrate that PATCH-CLIP sets new state of the art
performance, consistently outperforming the state-of-the-art in metrics like
BLEU, ROUGE-L, METEOR, and Recall
Delving into Commit-Issue Correlation to Enhance Commit Message Generation Models
Commit message generation (CMG) is a challenging task in automated software
engineering that aims to generate natural language descriptions of code changes
for commits. Previous methods all start from the modified code snippets,
outputting commit messages through template-based, retrieval-based, or
learning-based models. While these methods can summarize what is modified from
the perspective of code, they struggle to provide reasons for the commit. The
correlation between commits and issues that could be a critical factor for
generating rational commit messages is still unexplored.
In this work, we delve into the correlation between commits and issues from
the perspective of dataset and methodology. We construct the first dataset
anchored on combining correlated commits and issues. The dataset consists of an
unlabeled commit-issue parallel part and a labeled part in which each example
is provided with human-annotated rational information in the issue.
Furthermore, we propose \tool (\underline{Ex}traction, \underline{Gro}unding,
\underline{Fi}ne-tuning), a novel paradigm that can introduce the correlation
between commits and issues into the training phase of models. To evaluate
whether it is effective, we perform comprehensive experiments with various
state-of-the-art CMG models. The results show that compared with the original
models, the performance of \tool-enhanced models is significantly improved.Comment: ASE2023 accepted pape
Just-in-Time Security Patch Detection -- LLM At the Rescue for Data Augmentation
In the face of growing vulnerabilities found in open-source software, the need to identify {discreet} security patches has become paramount. The lack of consistency in how software providers handle maintenance often leads to the release of security patches without comprehensive advisories, leaving users vulnerable to unaddressed security risks. To address this pressing issue, we introduce a novel security patch detection system, LLMDA, which capitalizes on Large Language Models (LLMs) and code-text alignment methodologies for patch review, data enhancement, and feature combination. Within LLMDA, we initially utilize LLMs for examining patches and expanding data of PatchDB and SPI-DB, two security patch datasets from recent literature. We then use labeled instructions to direct our LLMDA, differentiating patches based on security relevance. Following this, we apply a PTFormer to merge patches with code, formulating hybrid attributes that encompass both the innate details and the interconnections between the patches and the code. This distinctive combination method allows our system to capture more insights from the combined context of patches and code, hence improving detection precision. Finally, we devise a probabilistic batch contrastive learning mechanism within batches to augment the capability of the our LLMDA in discerning security patches. The results reveal that LLMDA significantly surpasses the start of the art techniques in detecting security patches, underscoring its promise in fortifying software maintenance
Patch-CLIP : A Patch-Text Pre-Trained Model
In recent years, patch representation learning has emerged as a necessary research direction for exploiting the capabilities of machine learning in software generation. These representations have driven significant performance enhancements across a variety of tasks involving code changes. While the progress is undeniable, a common limitation among existing models is their specialization: they predominantly excel in either predictive tasks, such as security patch classification, or in generative tasks such as patch description generation. This dichotomy is further exacerbated by a prevalent dependency on potentially noisy data sources. Specifically, many models utilize patches integrated with Abstract Syntax Trees (AST) that, unfortunately, may contain parsing inaccuracies, thus acting as a suboptimal source of supervision. In response to these challenges, we introduce PATCH-CLIP, a novel pre-training framework for patches and natural language text. PATCH-CLIP deploys a triple-loss training strategy for 1) patch-description contrastive learning, which enables to separate patches and descriptions in the embedding space, 2) patch-description matching, which ensures that each patch is associated to its description in the embedding space, and 3) patch-description generation, which ensures that the patch embedding is effective for generation. These losses are implemented for joint learning to achieve good performance in both predictive and generative tasks involving patches. Empirical evaluations focusing on patch description generation, demonstrate that PATCH-CLIP sets new state of the art performance, consistently outperforming the state-of-the-art in metrics like BLEU, ROUGE-L, METEOR, and Recall
Learning to Represent Patches
Patch representation is crucial in automating various software engineering
tasks, like determining patch accuracy or summarizing code changes. While
recent research has employed deep learning for patch representation, focusing
on token sequences or Abstract Syntax Trees (ASTs), they often miss the
change's semantic intent and the context of modified lines. To bridge this gap,
we introduce a novel method, Patcherizer. It delves into the intentions of
context and structure, merging the surrounding code context with two innovative
representations. These capture the intention in code changes and the intention
in AST structural modifications pre and post-patch. This holistic
representation aptly captures a patch's underlying intentions. Patcherizer
employs graph convolutional neural networks for structural intention graph
representation and transformers for intention sequence representation. We
evaluated Patcherizer's embeddings' versatility in three areas: (1) Patch
description generation, (2) Patch accuracy prediction, and (3) Patch intention
identification. Our experiments demonstrate the representation's efficacy
across all tasks, outperforming state-of-the-art methods. For example, in patch
description generation, Patcherizer excels, showing an average boost of 19.39%
in BLEU, 8.71% in ROUGE-L, and 34.03% in METEOR scores
Hyperbolic Code Retrieval: A Novel Approach for Efficient Code Search Using Hyperbolic Space Embeddings
Within the realm of advanced code retrieval, existing methods have primarily relied on intricate matching and attention-based mechanisms. However, these methods often lead to computational and memory inefficiencies, posing a significant challenge to their real-world applicability. To tackle this challenge, we propose a novel approach, the Hyperbolic Code QA Matching (HyCoQA). This approach leverages the unique properties of Hyperbolic space to express connections between code fragments and their corresponding queries, thereby obviating the necessity for intricate interaction layers. The process commences with a reimagining of the code retrieval challenge, framed within a question-answering (QA) matching framework, constructing a dataset with triple matches characterized as \texttt{}. These matches are subsequently processed via a static BERT embedding layer, yielding initial embeddings. Thereafter, a hyperbolic embedder transforms these representations into hyperbolic space, calculating distances between the codes and descriptions. The process concludes by implementing a scoring layer on these distances and leveraging hinge loss for model training. Especially, the design of HyCoQA inherently facilitates self-organization, allowing for the automatic detection of embedded hierarchical patterns during the learning phase. Experimentally, HyCoQA showcases remarkable effectiveness in our evaluations: an average performance improvement of 3.5\% to 4\% compared to state-of-the-art code retrieval techniques
Multilevel Semantic Embedding of Software Patches: A Fine-to-Coarse Grained Approach Towards Security Patch Detection
The growth of open-source software has increased the risk of hidden vulnerabilities that can affect downstream software applications. This concern is further exacerbated by software vendors' practice of silently releasing security patches without explicit warnings or common vulnerability and exposure (CVE) notifications. This lack of transparency leaves users unaware of potential security threats, giving attackers an opportunity to take advantage of these vulnerabilities. In the complex landscape of software patches, grasping the nuanced semantics of a patch is vital for ensuring secure software maintenance. To address this challenge, we introduce a multilevel Semantic Embedder for security patch detection, termed MultiSEM. This model harnesses word-centric vectors at a fine-grained level, emphasizing the significance of individual words, while the coarse-grained layer adopts entire code lines for vector representation, capturing the essence and interrelation of added or removed lines. We further enrich this representation by assimilating patch descriptions to obtain a holistic semantic portrait. This combination of multi-layered embeddings offers a robust representation, balancing word complexity, understanding code-line insights, and patch descriptions. Evaluating MultiSEM for detecting patch security, our results demonstrate its superiority, outperforming state-of-the-art models with promising margins: a 22.46\% improvement on PatchDB and a 9.21\% on SPI-DB in terms of the F1 metric
Enhancing Text-to-SQL Translation for Financial System Design
Text-to-SQL, the task of translating natural language questions into SQL queries, is part of various business processes. Its automation, which is an emerging challenge, will empower software practitioners to seamlessly interact with relational databases using natural language, thereby bridging the gap between business needs and software capabilities. In this paper, we consider Large Language Models (LLMs), which have achieved state of the art for various NLP tasks. Specifically, we benchmark Text-to-SQL performance, the evaluation methodologies, as well as input optimization (e.g., prompting). In light of the empirical observations that we have made, we propose two novel metrics that were designed to adequately measure the similarity between SQL queries. Overall, we share with the community various findings, notably on how to select the right LLM on Text-to-SQL tasks. We further demonstrate that a tree-based edit distance constitutes a reliable metric for assessing the similarity between generated SQL queries and the oracle for benchmarking Text2SQL approaches. This metric is important as it relieves researchers from the need to perform computationally expensive experiments such as executing generated queries as done in prior works. Our work implements financial domain use cases and, therefore contributes to the advancement of Text2SQL systems and their practical adoption in this domain